Open and Reproducible Soil Science

With great data comes great responsibility!

Alex Koiter

Open and Reproducible Soil Science

With great data comes great responsibility!

Dr. Alex Koiter

alexkoiter.ca
@alex_koiter
koitera@brandonu.ca

Outline

From field/lab to dissemination and everything in between!

Artwork by Allison Horst

Definitions

Reproducibility

  • Given the same data set the analyses can be reproduced

Replicability

  • A separate investigator conducted an independent study and came to the same conclusion as the original study

Reproducibility

  • Less about ensuring the correctness of the results
  • More about being transparent and understanding exactly what was done
    • This is especially important in large and complex data sets

Important

A study can be reproducible and still be wrong

Open data and science

  • Making the data set, methods, and interpretation:
    • Available
    • Accessible
    • Transparent
    • Reproducible

The six core principles of Open Science; Gallaher et al. 2020 Nat Ecol Evol

Computational reproducibility

Important because:

  • Allows us to evaluate the data, analyses, and models on which conclusions are drawn
  • Allows you to revisit your own work (e.g., incorporate a suggestion)

It is difficult to reproduce because:

  • Data is not made available/open
  • Method sections of papers often do not provide enough detail
  • Use of graphical programs (clicks and drop down menus)
  • Not making code available/open (R, Python, MATLAB)

Working with large data sets

Data aquisition

  • Documenting getting/downloading/importing data sets
    • Always maintain the original data (unmodified)

Example

  • Downloading weather data for a few different station over many years

LaZerte & Albers 2018; JOSS

Working with large data sets

Data munging/wrangling

  1. Formatting
  2. Merging
  3. Quality assurance
    • NA’s
    • 0’s
    • Detection limits
    • Outliers
    • Typos/errors

Need to document every change you make

Data analysis

Analysis/figures

  • Data used
  • Analysis used (trial and error)
  • Parameters
  • Diagnostics
  • Figure creation process
  • Software versions

Hard to write papers if you don’t
keep track of this!

Open data and science

Important because:

  • Improves global synthesis of knowledge
  • More robust and reliable science
  • Easier to build on existing work
  • Better collaboration
  • Improves collegiality
  • Promotes EDI
  • Community engagement

It is difficult to be open because:

  • Can be expensive
  • Can be time consuming
  • Lack the skills
  • Unsure of what might happen

Sharing is caring

How to make your data open

  • Where to host it
  • When to make it available
  • Removing sensitive data
  • Format
  • For how long
  • Raw or summarized data
  • Attribution

Sharing is caring

How to make your science open

  • Open access publications or preprints
  • Open peer review
  • Detailed methods
  • Open source software
  • Make code available
  • Make date available
  • Study preregistration

Why is this not practiced?

  1. Don’t know how
  2. Too busy
  3. It’s internal work
  4. Worried about being copied
  5. Worried about mistakes
  6. Rigged the data

Why is this not practiced?

  1. Don’t know how - learn! lots of support and tools
  2. Too busy - often faster in the long run
  3. It’s internal work - often a need to share
  4. Worried about being copied - in practice low risk
  5. Worried about mistakes - happens to everyone
  6. Rigged the data - you have bigger problems

How can you achieve this?

Reproducible data analysis

  • Extensive notes
    • What, when, with what
  • Programmatically
    • Scripts, R, Python, MATLAB, etc.

How can you achieve this?

Reproducible data analysis

  • Extensive notes
    • What, when, with what
  • Programmatically
    • Scripts, R, Python, MATLAB, etc.
  • Version control

Recent scuccess story

  • Collaborating with Ehsan Zarrinabadi on soil erosion project
  • Creating figures with a large data set

Recent scuccess story

  • Ehsan identifies an issue where a site has samples where % sand + % silt + % clay \(\neq\) 100 %
    • Data is nonsensical
  • Requests my help to resolve
  • Because Ehsan has a reproducible work flow (R script) he sends me the data and script

Recent scuccess story

  • Working with an AAFC collaborator (Dr. Sheng Li)
  • Asked to collect measurements and processes the data
  • I wanted to have a reproducible data analysis work flow as a deliverable

Quarto

Recent scuccess story

  • Loading in the data
    • Formatting of the data was odd to accommodate field note taking

Recent scuccess story

  • Calculations and output
    • Very clear as to what was done

Recent scuccess story

  • Reproducible
    • Data

Recent scuccess story

  • Reproducible
    • Software versions

How can you achieve this?

Open data and science

How can you achieve this?

Open data and science

  • Open and free software

Recent scuccess story

Colour analysis

  • Worked on several projects where the colour of soil and sediment needed to be characterized
    • Also collaborated on several projects with the same goal
  • Methods of analysis were well established in the literature
    • Math was a bit complicated and used lots of data
    • MATLAB script was available (not free)
  • Developed an R script

Credit: Masoud

Recent scuccess story

GitHub

  • Was emailing the script to collaborators
  • Got tired of emailing
  • Changes are tracked through commits
  • Have added scripts to create figures
  • Most up to date version easily available
  • Has example data
  • Has a copyright license
  • Has a DOI
  • Easy for others to contribute via Pull requests
  • Can open issues (bugs)

Ongoing scuccess story

Wind erosion project

  • Remote sensing of soil surface properties
  • Involves lots of imagery and other large files
    • Emailing not an option
  • Wanted a central repository for:
    • Raw data
    • Data analysis scripts
    • Processed data (results)
    • Publications
  • Good candidate to try out OSF
  • Currently a private repository

Closing thoughts

Being reproducible and open make science stronger

  • One size does not fit all
    • Lots of different approaches and challenges
  • It is not all or nothing
    • Anything you do is awesome
  • It takes time and effort to learn and implement
    • Everybody has a different workload and priorities
    • Great support and resources are available
  • It can make you feel vulnerable
    • Mistakes happen
    • Can only be fixed if found
    • This is not a sign of weakness - hiding or not learning from them is